The High Time Resolution Universe (HTRU 2) data set was originally published by Robert Lyon using the CC BY 4.0 license. This data set reports pulsar candidates collected during the HTRU survey. Pulsars are a type of star, of considerable scientific interest. Candidates must be classified in to pulsar and non-pulsar classes to aid discovery.
More information about the HTRU survey can be found here. To learn more about other pulsar surveys, have a look at this article. You may also read the original publication.
Each record is described by 8 continuous variables, and a single class variable. The first four are simple statistics obtained from the integrated pulse profile. This is an array of continuous variables that describe a wavelength-resolved version of the signal that has been averaged in both time and frequency. The remaining four variables are similarly obtained from the DM-SNR curve. These are summarized below:
HTRU 2 Summary:
tibble(head(dat, 10))
summary(dat)
## IP_mean IP_std_dev IP_excess_kurtosis IP_skewness
## Min. : 5.8 Min. :24.8 Min. :-1.88 Min. :-1.8
## 1st Qu.:100.9 1st Qu.:42.4 1st Qu.: 0.03 1st Qu.:-0.2
## Median :115.1 Median :46.9 Median : 0.22 Median : 0.2
## Mean :111.1 Mean :46.5 Mean : 0.48 Mean : 1.8
## 3rd Qu.:127.1 3rd Qu.:51.0 3rd Qu.: 0.47 3rd Qu.: 0.9
## Max. :192.6 Max. :98.8 Max. : 8.07 Max. :68.1
## DM_SNR_mean DM_SNR_std_dev DM_SNR_excess_kurtosis DM_SNR_skewness
## Min. : 0.2 Min. : 7.4 Min. :-3.1 Min. : -2
## 1st Qu.: 1.9 1st Qu.: 14.4 1st Qu.: 5.8 1st Qu.: 35
## Median : 2.8 Median : 18.5 Median : 8.4 Median : 83
## Mean : 12.6 Mean : 26.3 Mean : 8.3 Mean : 105
## 3rd Qu.: 5.5 3rd Qu.: 28.4 3rd Qu.:10.7 3rd Qu.: 139
## Max. :223.4 Max. :110.6 Max. :34.5 Max. :1191
## class
## 0:16259
## 1: 1639
##
##
##
##
In the following paragraphs, we attempt to determine an appropriate model for predicting the outcome in out-of-sample data sets. The algorithms to be trained using Cross-Validation (CV) with the caret R-package include:
Finally, an ensemble of the previous models will be crafted. NOTE: Please be patient, it will probably take a while to train all the models.
# Generalized Linear Model
fit_glm <- train(class ~ ., method = "glm", data = train_set)
y_hat_glm <- predict(fit_glm, test_set)
acc_glm <- confusionMatrix(y_hat_glm, factor(test_set$class))$overall[["Accuracy"]]
# k-Nearest Neighbors
fit_knn <- train(class ~ .,
method = "knn",
data = train_set,
tuneGrid = data.frame(k = seq(3, 15, 2)))
y_hat_knn <- predict(fit_knn, test_set)
acc_knn <- confusionMatrix(y_hat_knn, factor(test_set$class))$overall[["Accuracy"]]
# Random Forest
set.seed(10, sample.kind = "default") # if using R 3.6 or later
fit_rf <- train(class ~ .,
method = "rf",
data = train_set,
nodesize = 1,
tuneGrid = data.frame(mtry = seq(2, 10, 2)))
imp_rf <- varImp(fit_rf)
y_hat_rf <- predict(fit_rf, newdata = test_set)
acc_rf <- confusionMatrix(y_hat_rf, test_set$class)$overall[["Accuracy"]]
# Support Vector Machine
fit_svm <- train(class ~ ., method = "svmRadial", data = train_set)
y_hat_svm <- predict(fit_svm, test_set)
acc_svm <- confusionMatrix(y_hat_svm, factor(test_set$class))$overall[["Accuracy"]]
# Regularized Discriminant Analysis
fit_rda <- train(class ~ ., method = "rda", data = train_set)
y_hat_rda <- predict(fit_rda, test_set)
acc_rda <- confusionMatrix(y_hat_rda, factor(test_set$class))$overall[["Accuracy"]]
Let us learn more about the data characteristics from the models trained
From the KNN model plot, we observe that the number of neighbors used to model the data do not impact much the accuracy of the model on the training set. A similar observation arises from the accuracy of the Random Forest model as a function of the number of predictors. However, the Random Forest variable importance plot indicates that a substantial portion of the outcome can be predicted by taking into account a fairly small number of variables, namely the excess kurtosis, skewness and mean values of the integrated profile prevail.
Let us inspect the out of sample performance of our models to compare their outputs
At first glance, the histogram seems to indicate very similar out-of-sample performances for each model trained. Let us compare their performance statistics:
models_compare <- resamples(list(
GLM = fit_glm,
KNN = fit_knn,
RF = fit_rf,
SVM = fit_svm,
RDA = fit_rda
))
summary(models_compare)
##
## Call:
## summary.resamples(object = models_compare)
##
## Models: GLM, KNN, RF, SVM, RDA
## Number of resamples: 25
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## GLM 0.977 0.978 0.980 0.979 0.980 0.983 0
## KNN 0.967 0.972 0.974 0.973 0.975 0.976 0
## RF 0.978 0.979 0.980 0.980 0.980 0.983 0
## SVM 0.976 0.979 0.980 0.980 0.981 0.983 0
## RDA 0.972 0.974 0.975 0.975 0.976 0.978 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## GLM 0.853 0.861 0.870 0.868 0.874 0.893 0
## KNN 0.791 0.821 0.830 0.828 0.839 0.851 0
## RF 0.857 0.866 0.874 0.873 0.880 0.890 0
## SVM 0.847 0.867 0.872 0.871 0.880 0.891 0
## RDA 0.818 0.833 0.838 0.838 0.845 0.856 0
A summary of the models provides further evidence that the models herein used perform similarly regardless of the classification of the different algorithms. In other words, given that our models (linear, prototype, random forest, cost sensitive and discriminant analysis) exhibit equivalent accuracy values, we may focus on time efficiency for classifying pulsar surveys data in the academia.
Let us now build an ensemble with the trained models and inspect its performance:
rows <- 1:length(test_set$class)
n <- length(models)
y_hat <- sapply(rows, function(rows){
x <- ifelse(y_hat_glm[rows] == 1, 1, 0)
x <- x + ifelse(y_hat_knn[rows] == 1, 1, 0)
x <- x + ifelse(y_hat_rf[rows] == 1, 1, 0)
x <- x + ifelse(y_hat_svm[rows] == 1, 1, 0)
x <- x + ifelse(y_hat_rda[rows] == 1, 1, 0)
ifelse(x >= 3, 1, 0)
}) %>% factor
confusionMatrix(y_hat, test_set$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3227 54
## 1 25 274
##
## Accuracy : 0.978
## 95% CI : (0.973, 0.982)
## No Information Rate : 0.908
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.862
##
## Mcnemar's Test P-Value : 0.00163
##
## Sensitivity : 0.992
## Specificity : 0.835
## Pos Pred Value : 0.984
## Neg Pred Value : 0.916
## Prevalence : 0.908
## Detection Rate : 0.901
## Detection Prevalence : 0.916
## Balanced Accuracy : 0.914
##
## 'Positive' Class : 0
##
From our results, the shape of the integrated pulse profile and its statistics are the main factors to determine whether the spectra measured by the detector corresponds to a pulsar or background noise. Moreover, the choice of the classification algorithm does not impact much the accuracy of the results nor an ensemble improves the output much, and the choice of the classification algorithm can be focused on time efficiency in an academic setup.